In this lab, we will apply tree-based classification methods to the Endometrium vs. Uterus cancer data. For documentation see: http://scikit-learn.org/0.17/modules/tree.html
Let us start, as usual, by setting up our environment, loading the data, and setting up our cross-validation.
In [ ]:
import numpy as np
%pylab inline
In [ ]:
# Load the data
# TODO
In [ ]:
# Normalize the data
from sklearn import preprocessing
X = preprocessing.normalize(X)
In [ ]:
# Set up a stratified 10-fold cross-validation
from sklearn import cross_validation
folds = cross_validation.StratifiedKFold(y, 10, shuffle=True)
In [ ]:
def cross_validate(design_matrix, labels, classifier, cv_folds):
""" Perform a cross-validation and returns the predictions.
Parameters:
-----------
design_matrix: (n_samples, n_features) np.array
Design matrix for the experiment.
labels: (n_samples, ) np.array
Vector of labels.
classifier: sklearn classifier object
Classifier instance; must have the following methods:
- fit(X, y) to train the classifier on the data X, y
- predict_proba(X) to apply the trained classifier to the data X and return probability estimates
cv_folds: sklearn cross-validation object
Cross-validation iterator.
Return:
-------
pred: (n_samples, ) np.array
Vectors of predictions (same order as labels).
"""
pred = np.zeros(labels.shape)
for tr, te in cv_folds:
# Restrict data to train/test folds
Xtr = design_matrix[tr, :]
ytr = labels[tr]
Xte = design_matrix[te, :]
#print Xtr.shape, ytr.shape, Xte.shape
# Fit classifier
classifier.fit(Xtr, ytr)
# Predict probabilities (of belonging to +1 class) on test data
yte_pred = classifier.predict_proba(Xte)
index_of_class_1 = 1 - ytr[0] # 0 if the first sample is positive, 1 otherwise
pred[te] = yte_pred[:, index_of_class_1]
return pred
Question Cross-validate 5 different decision trees, with default parameters.
In [ ]:
from sklearn import tree
from sklearn import metrics
# Use: clf = tree.DecisionTreeClassifier()
ypred_dt = [] # will hold the 5 arrays of predictions (1 per tree)
for tree_index in range(5):
# TODO
Question Compute the mean and standard deviation of the area under the ROC curve of these 5 trees. Plot the ROC curves of these 5 trees.
In [ ]:
fpr_dt = [] # will hold the 5 arrays of false positive rates (1 per tree)
tpr_dt = [] # will hold the 5 arrays of true positive rates (1 per tree)
auc_dt = [] # will hold the 5 areas under the ROC curve (1 per tree)
for tree_index in range(5):
# TODO
for tree_index in range(4):
plt.plot(fpr_dt[tree_index], tpr_dt[tree_index], '-', color='orange')
plt.plot(fpr_dt[-1], tpr_dt[-1], '-', color='orange',
label='DT (AUC = %0.2f (+/- %0.2f))' % (np.mean(auc_dt), np.std(auc_dt)))
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curves', fontsize=16)
plt.legend(loc="lower right")
Question What parameters of DecisionTreeClassifier can you play with to define trees differently than with the default parameters? Cross-validate these using a grid search, and plot the optimal decision tree on the previous plot. Did you manage to improve performance?
In [ ]:
from sklearn import grid_search
param_grid = # TODO
clf = grid_search.GridSearchCV(tree.DecisionTreeClassifier(), param_grid,
scoring='roc_auc')
ypred_dt_opt = cross_validate(X, y, clf, folds)
fpr_dt_opt, tpr_dt_opt, thresholds = metrics.roc_curve(y, ypred_dt_opt, pos_label=1)
auc_dt_opt = metrics.auc(fpr_dt_opt, tpr_dt_opt)
In [ ]:
# Plot the 5 decision trees from earlier
for tree_index in range(4):
plt.plot(fpr_dt[tree_index], tpr_dt[tree_index], '-', color='blue')
plt.plot(fpr_dt[-1], tpr_dt[-1], '-', color='blue',
label='DT (AUC = %0.2f (+/- %0.2f))' % (np.mean(auc_dt), np.std(auc_dt)))
# Plot the optimized decision tree
plt.plot(fpr_dt_opt, tpr_dt_opt, color='orange', label='DT optimized (AUC=%0.2f)' % auc)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curves', fontsize=16)
plt.legend(loc="lower right")
Question How does the performance of decision trees compare to the performance of classifiers we have used previously on this data? Does this match your expectations?
We will resort to ensemble methods to try to improve the performance of single decision trees. Let us start with bagging trees: The different trees are to be built using a bagging sample of the data, that is to say, a sample built by using as many data points, drawn with replacement from the original data.
Note: Bagging trees and random forests start making sense when using large number of trees (several hundreds). This is computationally more intensive, especially when the number of features is large, as in this lab. For the sake of computational time, I suggested using small numbers of trees, but you might want to repeat this lab for larger number of trees at home.
Question Cross-validate a bagging ensemble of 5 decision trees on the data. Plot the resulting ROC curve, compared to the 5 decision trees you trained earlier.
In [ ]:
from sklearn import ensemble
# By default, the base estimator is a decision tree with default parameters
# TODO: Use clf = ensemble.BaggingClassifier(n_estimators=5)
In [ ]:
Question Use cross_validate_optimize
(as defined in the previous lab) to optimize the number of decision trees to use in the bagging method. How many trees did you find to be an optimal choice?
In [ ]:
def cross_validate_optimize(design_matrix, labels, classifier, cv_folds):
""" Perform a cross-validation and returns the predictions.
Parameters:
-----------
design_matrix: (n_samples, n_features) np.array
Design matrix for the experiment.
labels: (n_samples, ) np.array
Vector of labels.
classifier: sklearn GridSearchCV object
GridSearchCV instance; must have the following methods/attributes:
- fit(X, y) to train the classifier on the data X, y
- predict_proba(X) to apply the trained classifier to the data X and return probability estimates
cv_folds: sklearn cross-validation object
- best_params_ the best parameter dictionary
Cross-validation iterator.
Return:
-------
pred: (n_samples, ) np.array
Vector of predictions (same order as labels).
"""
pred = np.zeros(labels.shape)
for tr, te in cv_folds:
# Restrict data to train/test folds
Xtr = design_matrix[tr, :]
ytr = labels[tr]
Xte = design_matrix[te, :]
#print Xtr.shape, ytr.shape, Xte.shape
# Fit classifier
classifier.fit(Xtr, ytr)
# Print best parameter
print classifier.best_params_
# Predict probabilities (of belonging to +1 class) on test data
yte_pred = classifier.predict_proba(Xte)
index_of_class_1 = 1 - ytr[0] # 0 if the first sample is positive, 1 otherwise
pred[te] = yte_pred[:, index_of_class_1]
return pred
In [ ]:
param_grid = {'n_estimators': [5, 15, 25, 50]}
# TODO
Question Plot the ROC curve of the optimized cross-validated bagging tree classifier obtained with cross_validate_optimize
, and compare it to the previous ROC curves (non-optimized bagging tree, decision trees).
In [ ]:
Question Cross-validate a random forest of 5 decision trees on the data. Plot the resulting ROC curve, compared to the 5 decision trees you trained earlier, and the bagging tree made of 5 decision trees.
In [ ]:
clf = ensemble.RandomForestClassifier(n_estimators=5)
# TODO
In [ ]:
Question Use cross_validate_optimize
(as defined in the previous lab) to optimize the number of decision trees to use in the random forest. How many trees do you find to be an optimal choice? How does the optimal random forest compare to the optimal bagging trees? How do the training times of the random forest and the bagging trees compare?
In [ ]:
param_grid = {'n_estimators': [5, 15, 25, 50]}
# TODO
In [ ]:
Question How do your tree-based classifiers compare to the linear regression (regularized or not)? Plot ROC curves.
In [ ]:
from sklearn import linear_model
param_grid = {'C':[1e-3, 1e-2, 1e-1, 1., 1e2, 1e3]}
clf = grid_search.GridSearchCV(linear_model.LogisticRegression(penalty='l1'),
param_grid, scoring='roc_auc')
ypred_l1 = cross_validate_optimize(X, y, clf, folds)
fpr_l1, tpr_l1, thresholds_l1 = metrics.roc_curve(y, ypred_l1, pos_label=1)
clf = grid_search.GridSearchCV(linear_model.LogisticRegression(penalty='l2'),
param_grid, scoring='roc_auc')
ypred_l2 = cross_validate_optimize(X, y, clf, folds)
fpr_l2, tpr_l2, thresholds_l2 = metrics.roc_curve(y, ypred_l2, pos_label=1)
In [ ]:
# TODO
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curves', fontsize=16)
plt.legend(loc="lower right")
You can find the documentation for tree-based regression here:
In [ ]: